gpt 3
AI-Enabled grading with near-domain data for scaling feedback with human-level accuracy
Agarwal, Shyam, Moghimi, Ali, Haudek, Kevin C.
Constructed-response questions are crucial to encourage generative processing and test a learner's understanding of core concepts. However, the limited availability of instructor time, large class sizes, and other resource constraints pose significant challenges in providing timely and detailed evaluation, which is crucial for a holistic educational experience. In addition, providing timely and frequent assessments is challenging since manual grading is labor intensive, and automated grading is complex to generalize to every possible response scenario. This paper proposes a novel and practical approach to grade short-answer constructed-response questions. We discuss why this problem is challenging, define the nature of questions on which our method works, and finally propose a framework that instructors can use to evaluate their students' open-responses, utilizing near-domain data like data from similar questions administered in previous years. The proposed method outperforms the state of the art machine learning models as well as non-fine-tuned large language models like GPT 3.5, GPT 4, and GPT 4o by a considerable margin of over 10-20% in some cases, even after providing the LLMs with reference/model answers. Our framework does not require pre-written grading rubrics and is designed explicitly with practical classroom settings in mind. Our results also reveal exciting insights about learning from near-domain data, including what we term as accuracy and data advantages using human-labeled data, and we believe this is the first work to formalize the problem of automated short answer grading based on the near-domain data.
- North America > United States > Michigan (0.04)
- North America > United States > California > Yolo County > Davis (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Instructional Material (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Education > Curriculum (1.00)
- Education > Educational Setting > Online (0.93)
- (3 more...)
Comparing the Performance of LLMs in RAG-based Question-Answering: A Case Study in Computer Science Literature
Dayarathne, Ranul, Ranaweera, Uvini, Ganegoda, Upeksha
Retrieval Augmented Generation (RAG) is emerging as a powerful technique to enhance the capabilities of Generative AI models by reducing hallucination. Thus, the increasing prominence of RAG alongside Large Language Models (LLMs) has sparked interest in comparing the performance of different LLMs in question-answering (QA) in diverse domains. This study compares the performance of four open-source LLMs, Mistral-7b-instruct, LLaMa2-7b-chat, Falcon-7b-instruct and Orca-mini-v3-7b, and OpenAI's trending GPT-3.5 over QA tasks within the computer science literature leveraging RAG support. Evaluation metrics employed in the study include accuracy and precision for binary questions and ranking by a human expert, ranking by Google's AI model Gemini, alongside cosine similarity for long-answer questions. GPT-3.5, when paired with RAG, effectively answers binary and long-answer questions, reaffirming its status as an advanced LLM. Regarding open-source LLMs, Mistral AI's Mistral-7b-instruct paired with RAG surpasses the rest in answering both binary and long-answer questions. However, among the open-source LLMs, Orca-mini-v3-7b reports the shortest average latency in generating responses, whereas LLaMa2-7b-chat by Meta reports the highest average latency. This research underscores the fact that open-source LLMs, too, can go hand in hand with proprietary models like GPT-3.5 with better infrastructure.
- Health & Medicine > Therapeutic Area (0.69)
- Information Technology > Security & Privacy (0.46)
Safer in Translation? Presupposition Robustness in Indic Languages
Palnitkar, Aadi, Suresh, Arjun, Rajesh, Rishi, Puli, Puneet
Increasingly, more and more people are turning to large language models (LLMs) for healthcare advice and consultation, making it important to gauge the efficacy and accuracy of the responses of LLMs to such queries. While there are pre-existing medical benchmarks literature which seeks to accomplish this very task, these benchmarks are almost universally in English, which has led to a notable gap in existing literature pertaining to multilingual LLM evaluation. Within this work, we seek to aid in addressing this gap with Cancer-Myth-Indic, an Indic language benchmark built by translating a 500-item subset of Cancer-Myth, sampled evenly across its original categories, into five under-served but widely used languages from the subcontinent (500 per language; 2,500 translated items total). Native-speaker translators followed a style guide for preserving implicit presuppositions in translation; items feature false presuppositions relating to cancer. We evaluate several popular LLMs under this presupposition stress.
- Asia > India (0.15)
- North America > United States > Maryland > Prince George's County > College Park (0.14)
- Asia > Indonesia > Bali (0.04)
Scaling Truth: The Confidence Paradox in AI Fact-Checking
Qazi, Ihsan A., Khan, Zohaib, Ghani, Abdullah, Raza, Agha A., Qazi, Zafar A., Sajjad, Wassay, Ali, Ayesha, Javaid, Asher, Sohail, Muhammad Abdullah, Azeemi, Abdul H.
The rise of misinformation underscores the need for scalable and reliable fact-checking solutions. Large language models (LLMs) hold promise in automating fact verification, yet their effectiveness across global contexts remains uncertain. We systematically evaluate nine established LLMs across multiple categories (open/closed-source, multiple sizes, diverse architectures, reasoning-based) using 5,000 claims previously assessed by 174 professional fact-checking organizations across 47 languages. Our methodology tests model generalizability on claims postdating training cutoffs and four prompting strategies mirroring both citizen and professional fact-checker interactions, with over 240,000 human annotations as ground truth. Findings reveal a concerning pattern resembling the Dunning-Kruger effect: smaller, accessible models show high confidence despite lower accuracy, while larger models demonstrate higher accuracy but lower confidence. This risks systemic bias in information verification, as resource-constrained organizations typically use smaller models. Performance gaps are most pronounced for non-English languages and claims originating from the Global South, threatening to widen existing information inequalities. These results establish a multilingual benchmark for future research and provide an evidence base for policy aimed at ensuring equitable access to trustworthy, AI-assisted fact-checking.
- Africa > South Africa (0.14)
- South America > Brazil (0.14)
- Europe > Ukraine (0.14)
- (20 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Law (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- (2 more...)
Help Me Write a Story: Evaluating LLMs' Ability to Generate Writing Feedback
Rashkin, Hannah, Clark, Elizabeth, Huot, Fantine, Lapata, Mirella
Can LLMs provide support to creative writers by giving meaningful writing feedback? In this paper, we explore the challenges and limitations of model-generated writing feedback by defining a new task, dataset, and evaluation frameworks. To study model performance in a controlled manner, we present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues. We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics. Our analysis shows that current models have strong out-of-the-box behavior in many respects -- providing specific and mostly accurate writing feedback. However, models often fail to identify the biggest writing issue in the story and to correctly decide when to offer critical vs. positive feedback.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (6 more...)
Cat and Mouse -- Can Fake Text Generation Outpace Detector Systems?
McGlinchey, Andrea, Barclay, Peter J
Large language models (LLMs) can produce convincing'fake text' in domains such as academic writing, product reviews, and political news. Many approaches have been investigated for the detection of artificially generated text. While this may seem to presage an endless'arms race', we note that newer LLMs use ever more parameters, training data, and energy, while relatively simple classifiers demonstrate a good level of detection accuracy with modest resources. To approach the question of whether the models ability to beat the detectors may therefore reach a plateau, we examine the ability of statistical classifiers to identify'fake text' in the style of classical detective fiction. Over a 0.5 version increase, we found that Gemini showed an increased ability to generate deceptive text, while GPT did not. This suggests that reliable detection of fake text may remain feasible even for ever-larger models, though new model architectures may improve their deceptiveness.
- Europe > United Kingdom > England (0.06)
- North America > Canada (0.04)
- Europe > United Kingdom > Scotland (0.04)
- Government (0.50)
- Media (0.48)
When LLM Therapists Become Salespeople: Evaluating Large Language Models for Ethical Motivational Interviewing
Large language models (LLMs) have been actively applied in the mental health field. Recent research shows the promise of LLMs in applying psychotherapy, especially motivational interviewing (MI). However, there is a lack of studies investigating how language models understand MI ethics. Given the risks that malicious actors can use language models to apply MI for unethical purposes, it is important to evaluate their capability of differentiating ethical and unethical MI practices. Thus, this study investigates the ethical awareness of LLMs in MI with multiple experiments. Our findings show that LLMs have a moderate to strong level of knowledge in MI. However, their ethical standards are not aligned with the MI spirit, as they generated unethical responses and performed poorly in detecting unethical responses. We proposed a Chain-of-Ethic prompt to mitigate those risks and improve safety. Finally, our proposed strategy effectively improved ethical MI response generation and detection performance. These findings highlight the need for safety evaluations and guidelines for building ethical LLM-powered psychotherapy.
- North America > United States > New Jersey > Middlesex County > New Brunswick (0.04)
- North America > United States > California > San Mateo County > San Mateo (0.04)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
- Health & Medicine > Consumer Health (0.93)
LimTopic: LLM-based Topic Modeling and Text Summarization for Analyzing Scientific Articles limitations
Azhar, Ibrahim Al, Reddy, Venkata Devesh, Alhoori, Hamed, Akella, Akhil Pandey
The limitations sections of scientific articles play a crucial role in highlighting the boundaries and shortcomings of research, thereby guiding future studies and improving research methods. Analyzing these limitations benefits researchers, reviewers, funding agencies, and the broader academic community. We introduce LimTopic, a strategy where Topic generation in Limitation sections in scientific articles with Large Language Models (LLMs). Here, each topic contains the title and Topic Summary. This study focuses on effectively extracting and understanding these limitations through topic modeling and text summarization, utilizing the capabilities of LLMs. We extracted limitations from research articles and applied an LLM-based topic modeling integrated with the BERtopic approach to generate a title for each topic and Topic Sentences. To enhance comprehension and accessibility, we employed LLM-based text summarization to create concise and generalizable summaries for each topic Topic Sentences and produce a Topic Summary. Our experimentation involved prompt engineering, fine-tuning LLM and BERTopic, and integrating BERTopic with LLM to generate topics, titles, and a topic summary. We also experimented with various LLMs with BERTopic for topic modeling and various LLMs for text summarization tasks. Our results showed that the combination of BERTopic and GPT 4 performed the best in terms of silhouette and coherence scores in topic modeling, and the GPT4 summary outperformed other LLM tasks as a text summarizer.
- North America > United States > Illinois (0.28)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe (0.14)
- Health & Medicine (1.00)
- Government (0.93)
Performance Evaluation of Large Language Models in Statistical Programming
Song, Xinyi, Xie, Kexin, Lee, Lina, Chen, Ruizhe, Clark, Jared M., He, Hao, He, Haoran, Min, Jie, Zhang, Xinlei, Zheng, Simin, Zhang, Zhiyang, Deng, Xinwei, Hong, Yili
The programming capabilities of large language models (LLMs) have revolutionized automatic code generation and opened new avenues for automatic statistical analysis. However, the validity and quality of these generated codes need to be systematically evaluated before they can be widely adopted. Despite their growing prominence, a comprehensive evaluation of statistical code generated by LLMs remains scarce in the literature. In this paper, we assess the performance of LLMs, including two versions of ChatGPT and one version of Llama, in the domain of SAS programming for statistical analysis. Our study utilizes a set of statistical analysis tasks encompassing diverse statistical topics and datasets. Each task includes a problem description, dataset information, and human-verified SAS code. We conduct a comprehensive assessment of the quality of SAS code generated by LLMs through human expert evaluation based on correctness, effectiveness, readability, executability, and the accuracy of output results. The analysis of rating scores reveals that while LLMs demonstrate usefulness in generating syntactically correct code, they struggle with tasks requiring deep domain understanding and may produce redundant or incorrect results. This study offers valuable insights into the capabilities and limitations of LLMs in statistical programming, providing guidance for future advancements in AI-assisted coding systems for statistical analysis.
- North America > United States > Florida > Hillsborough County > Tampa (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > Virginia > Montgomery County > Blacksburg (0.04)
- (3 more...)
LMN: A Tool for Generating Machine Enforceable Policies from Natural Language Access Control Rules using LLMs
Sonune, Pratik, Rai, Ritwik, Sural, Shamik, Atluri, Vijayalakshmi, Kundu, Ashish
Access control is a fundamental security requirement in any organization for ensuring that only authorized users can access certain information or resources under specific conditions. While enforcement needs to be done in computer systems, access control policies are typically decided by the higher management. For example, in a university system, the Department Chair, Dean and the Provost may take a decision on who can access which object (like Conference room printers, Graduate studies applications, Faculty tenure support letters, etc.) at the Department, School and University level, respectively. Such decisions are often noted down as meeting minutes, email exchanges, or other forms of documentation in a natural language like English (hereinafter referred to as Natural Language Access Control Policies, i.e., NLACPs). For information system level implementation of such decisions, System Security Officers (SSOs) must translate the NLACPs into Machine Enforceable Security Policies (MESPs) using a target access control model like Role-based Access Control (RBAC) or Attribute-based Access Control (ABAC). It is apparent that manual conversion of NLACPs into MESPs not only demands time and resource, it is also error prone, especially for large organizations with dynamically changing policies.
- Asia > India > West Bengal > Kharagpur (0.05)
- North America > United States (0.04)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)